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Abstract 

A methodology is presented to develop and analyze vectors of 
data quality attribute scores. Each data quality vector compo¬ 
nent represents the quality of the data element for a specific 
attribute (e.g., age of data). Several methods for aggregating 
the components of data quality vectors to derive one data qual¬ 
ity indicator (DQI) that represents the total quality associated 
with the input data element are presented with illustrative ex¬ 
amples. The methods are compared and it is proven that the 
measure of central tendency, or arithmetic average, of the data 
quality vector components as a percentage of the total quality 
range attainable is an equivalent measure for the aggregate DQI. 
In addition, the methodology is applied and compared to real- 
world LCA data pedigree matrices. Finally, a method for ag¬ 
gregating weighted data quality vector attributes is developed 
and an illustrative example is presented. This methodology 
provides LCA practitioners with an approach to increase the 
precision of input data uncertainty assessments by selecting 
any number of data quality attributes with which to score the 
LCA inventory model input data. The resultant vector of data 
quality attributes can then be analyzed to develop one aggre¬ 
gate DQI for each input data element for use in stochastic LCA 
modeling. 


Keywords: Data quality vector; LCA input data quality; LC1 
input data quality; Life Cycle Assessment; Life Cycle Inven¬ 
tory; stochastic LCA modeling 


1 Introduction 

The methodology for developing stochastic LCA models 
presented in Kennedy, Montgomery, and Quay (1996) uses 
a single rating to measure the overall quality of each data 
element. This rating is based on a sliding scale of one to 
five, with a one representing the worst quality case, i.e., 
maximum uncertainty, and a five representing the best qual¬ 
ity case, i.e., minimum uncertainty. This same concept is 
now expanded to enable the LCA practitioner to evaluate 


individual input data elements using a multitude of quality 
attributes. The attribute ratings, or scores, for each data 
element become the components of a data quality vector. 
These vectors are analyzed to derive aggregate quality rat¬ 
ings for each input data element to support stochastic LCA 
modeling. 

There are cases where aggregated assessments of input data 
quality are not readily available nor is it advisable in some 
cases due to poor data quality discrimination resolution. 
For these instances, an approach similar in concept to the 
pedigree matrix approach for assessing LCA model input 
data quality as discussed by Wkidkma and Wesnoes (1995) 
may be preferable. The method must provide LCA practi¬ 
tioners with a means to state their judgment about the qual¬ 
ity of the input data over a limitless array of discrete data 
quality considerations. Although Weidema and Wesnoes 
stated that the scores in the pedigree matrix are semi-quan- 
titative in that they serve as identification numbers only 
and should not be aggregated, these values do relate to the 
overall assessment as presented by Funtowicz and Ravetz 
(1990) in the Numeral, Unit, Spread, Assessment, and Pedi¬ 
gree (NUSAP) notation. In the absence of information other 
than the numeral (i.e., no probability distribution or meas¬ 
ure of variance (spread)), either an overall assessment needs 
to be created to apply the stochastic modeling approach to 
LCA inventory analyses or, a technique similar to the pedi¬ 
gree matrix that is designed to enable the aggregation of 
the elements for an overall assessment is needed. 

An alternative method to describe the uncertainty in LCA 
input data elements involves the creation of a ixn data 
quality vector, q. The components of q are established in a 
similar manner as the single-valued data quality indicator 
presented in Kennedy et al. The difference is that the com¬ 
ponents of q convert LCA practitioner qualitative judgments 
about specific data quality „attributes“ to quantitative in¬ 
dices. Some of the typical attributes LCA practitioners con¬ 
sider were discussed in Kennedy et al. These included data 
age, accuracy, completeness, and representativeness of the 
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total population/process, and frequency of collection/quan¬ 
tity of data collected. Of course, this list of typical attributes 
is not all inclusive and can vary between LCA practition¬ 
ers. For example, Wkidema and Wesnoes include geographi¬ 
cal correlation and technological correlation as data qual¬ 
ity factors. As the LCA methodology continues to mature 
and databases are further developed, additional data qual¬ 
ity descriptors may be warranted. To accommodate such 
advances in LCA technology, there is no limit on the number 
of components, «, in the vector, q. 

The DQ1 development methodology presented here enables 
the establishment of a data quality vector of any size and 
an evaluation of the vector to determine the amount of ag¬ 
gregate uncertainty it represents. Analyzing the data qual¬ 
ity vector using this methodology results in a single-valued 
indicator representing the aggregate uncertainty associated 


with the input data element. The aggregate uncertainty in¬ 
dicator maps directly to the DQI’s developed in Kennedy et 
al. and reproduced in Table 1 for ease of reference. 

The beta distribution parameters specified in Table 1 are 
used to generate random variables for input data in 
stochastic LCA inventory models. This, in turn, enables 
making multiple simulation runs of the LCA inventory 
model to produce results that can be compared using statis¬ 
tical methods. 

In the absence of information about the actual input data 
probability distribution, the beta probability distribution is 
reasonable to use for several reasons discussed in Kennedy 
er al. The beta distribution enables the use of range endpoints 
and two shape parameters a and ft that determine the mean 
and variance (i.e., spread) of the distribution. As a and ft 


Table 1: Beta probability distribution parameters for DQl's (baseline, sensitivity level I, and sensitivity level 2) 
Baseline: 


Data Quality Indicator 

' Beta Probability Distribution Parameters 

Shape Parameters (cx,(i) 

Range Endpoints (±%) 

5 

5,5 

10 

4.5 

4,4 

15 

4 

3,3 

20 

3.5 

2,2 

25 

3 

1,1 

30 

2.5 

1.1 

35 

2 

1,1 

40 

1.5 

1,1 

45 

1 

1,1 

50 


Sensitivity Level 1 (SENS L-l): 


Data Quality Indicator 

Beta Probability Distribution Parameters 

Shape Parameters (tx,|l) 

Range Endpoints (±%) 

5 

4,4 

20 

4,5 

3,3 

25 

4 

2,2 

30 

3.5 

1,1 

35 

3 

1,1 

40 

2.5 

1,1 

45 

2 

1,1 

50 

1.5 

1,1 

50 

1 

1,1 

50 


Sensitivity Level 2 (SENS L-2): 


Data Quality Indicator 

| Beta Probability Distribution Parameters 

Shape Parameters (a,P) 

Range Endpoints (±%) 

5 

3,3 

30 

4.5 

2,2 

35 

4 

1,1 

40 

3.5 

1,1 

45 

3 

1,1 

50 

2.5 

1,1 

50 

2 

1,1 

50 

1.5 

1,1 

50 

1 

1,1 

50 
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decrease from five to one in Table 1, the shape of the distri¬ 
bution becomes flatter indicating higher probability that 
values closer to the range endpoints will occur for the input 
data for each run of the stochastic LCA inventory model. 
Of course, range endpoints resulting from higher percent¬ 
ages of the input data values also indicate greater uncer¬ 
tainty (i.e., lower input data quality) because the input data 
can assume values over a wider interval. 

The sensitivity levels in Table 1 (i.e., Sensitivity Level 1 
(SENS L-l) and Sensitivity Level 2 (SENS L-2)) indicate 
increasing input data uncertainty as the levels increase. These 
beta distribution parameters are used when the LCA prac¬ 
titioner is conducting sensitivity analyses to provide the LCA 
inventory information users an understanding of the sensi¬ 
tivity of the results to under-, and over-, estimation of the 
input data quality. 

The methodology is presented in two sections. The first 
addresses the development of data quality vectors. The sec¬ 
ond section presents three data quality vector analysis meth¬ 
ods with illustrative examples and a proof of the equiva¬ 
lence of these methods. The aggregation methodology is 
applied to real-world LCA input data pedigrees that have 
similar features to the data quality vector. Comparisons of 
the methods are discussed which leads to the development 
and analysis of a weighted data quality vector. Finally, some 
concluding remarks are presented. 

2 Methodology 

2.1 Data quality vector development 

The process of developing data quality vectors begins with 
the selection of the data quality attributes with which to 
score the data. This can be done either by a single LCA 
practitioner or in a group setting. Many group decision tech¬ 
niques, e.g., Nominal Group Technique and the Delphi 
method (see Goicoechea, Hansen and Duckstein, 1982), 
are available to support this part of the process. As LCA 
techniques continue to mature and more research is accom¬ 
plished in the study of input data quality, evaluation stand¬ 
ards should emerge that standardize the quality attributes 
to be applied or at least provide a comprehensive set from 
which LCA practitioners can select. 

After the selection of the data quality attributes, the scor¬ 
ing process is similar to that presented in Kennedy et al. All 
LCA model input data are individually scored on the same 
sliding scale for each data quality attribute. The result is a 
vector of quality indices that represents the practitioner’s(s’) 
judgment(s) regarding the uncertainty associated with the 
data element. The data quality vector is denoted as q such 
that {q:i = 1,2,...,«} are the set of n data quality attributes 
that can take on data quality indice values in the range 
1 < q. < 5. Note that the scores are not restricted to integer 
values. 


Independence is maintained between assessment scores 
within the data quality vector. Quality assessment correla¬ 
tion that may exist between individual input data elements, 
e.g., data having the same age, is not a concern since the 
overall assessments are accomplished within individual data 
elements and not across data elements. Correlation between 
certain input data elements, e.g., chemical stoichiometry 
and mass balance considerations, does need to be analyzed 
and accounted for in the stochastic modeling approach, just 
as it is in the deterministic models. 


2.2 Data quality vector analysis for aggregate DQI 
assignment 

To implement the stochastic LCA inventory modeling meth¬ 
odology presented in Kennedy et al., an aggregate DQI must 
be derived for each data quality vector for those input data 
elements with no prior probability distribution information. 
The aggregate DQI must capture the input data element 
uncertainty represented by the applicable data quality vec¬ 
tor. Several methods are presented to accomplish this analy¬ 
sis. These include linear programming, vector projection, 
and expected value. Each results in a percentage of maxi¬ 
mum attainable quality represented by a given data quality 
vector. 

A proof is presented that provides LCA practitioners the 
assurance that each of these analysis approaches produce 
equivalent results. Therefore, LCA practitioners are free to 
choose the analytical method they find most practical and 
informative for their particular application. 

DQI’s are assigned to each LCA input data element accord¬ 
ing to the percentage of the maximum attainable quality 
value each represents. The DQI’s are assigned as indicated 
in Table 2. Once the DQI’s are assigned to each input data 
element, the LCA practitioner has the information needed 
to use Table 1 to select parameters for the associated beta 
probability distributions to implement the stochastic LCA 
modeling methodology. Of course, LCA practitioners should 
select parameters for the beta, or other applicable input 
data probability distributions, as appropriate to adequately 
model any "known" uncertainty regarding individual data 
elements. For example, if a particular data element should 
have a skewed distribution so that values closer to one range 
endpoint are more likely than those near the other, the ap¬ 
propriate beta distribution shape parameters should be cho¬ 
sen independently of Table 1. 

2.2.1 Linear programming (LP) method 

Linear programming (LP) can be used to evaluate the per¬ 
cent of the maximum quality function attained by the data 
quality vector of interest. Wu and Coppin (1981), Hadley 
(1962), and Hillier and Lieberman (1980) are among many 
authoritative references for details on LP. The first step in 
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this process requires the LCA practitioner to formulate the 
LP problem. 

The LP formulation involves the development of the qual¬ 
ity objective function to be maximized and the constraints 
associated with the decision variables. In this application, 
the decision variables can simply be thought of as the qual¬ 
ity indice values that can be assigned to any of the data 
quality attributes. 

The quality objective function is defined as: 

<2 = 1 >, ( 1 ) 

/'=! 

and it is subject to the following constraint space: 


Note that the <y, a I constraint makes the non-negativity con¬ 
straint required in LP problem formulation redundant. 

The solution to the LP can be accomplished by manually 
applying the simplex method or through graphical meth¬ 
ods if the problem is small enough. A number of commer¬ 
cially available software packages with embedded LP prob¬ 
lem solvers, such as Microsoft® Excel (see Microsoft® Excel 
Version 5.0 User’s Guide 1994), and Statgraphics® (see 
Statgraphics® Reference Manual Version 5 1991), can be 
used to find the maximum of the quality function. The to¬ 
tal number of data quality attributes, «, determines the de¬ 
gree of Euclidean (E") space that contains the feasible re¬ 
gion. Problems of this type in E 2 can be solved using a 
graphical representation of the LP problem. The LP solu¬ 
tion provides the maximum and minimum attainable qual¬ 
ity function value (i.e., max Q and min Q). 

Of course, for straightforward quality objective functions 
such as equation (1), the maximum can be readily identi¬ 
fied without applying formal LP solution methods. By sim¬ 
ple inspection, it can be noted that the maximum attain¬ 
able quality function value is when all n components have 
the maximum assignable score. In the case of the one 
through five sliding scale applied in this paper, where a five 
is the best assignable score, the maximum attainable 
unweighted quality function value is represented by (5«). 

The next step in the LP method is to determine what per¬ 
centage of the maximum attainable data quality has been 
achieved by each data quality vector. The quality function 
value for each data quality vector must be calculated. The 
resultant value represents one of an infinite number of par¬ 
allel iso-quality lines (in E 2 ) or parallel iso-quality cutting 
planes (in E", n>2). The percent of attainable quality repre¬ 
sented by each data quality vector is calculated as follows: 

% of attainable quality = — 6- minQ — x jqq (2) 

max Q - min Q 


where min Q is subtracted from Q in the numerator and 
max Q in the denominator to account for the true percent 
of attainable quality since min Q>0. 

2.2.1.1 Illustrative examples 

Consider the case of a data quality vector q containing n = 2 
data quality attributes. The age of the data might be repre¬ 
sented by q l and data representativeness by q ,. The formu¬ 
lation of the LP problem becomes (from equation (1)): 
Maximize Q = q t + q, 
subject to: 

V, >1 

</, 2 5 
</; - I 
<h ^5 

The graph of this formulation in E’ is shown in Figure 1. 
The feasible region is the interior of the square which is 
bounded by the four constraint equations. Several iso-qual- 
ity lines have been added to indicate that as the data qual¬ 
ity attribute indices improve from worst case (1,1) to best 
case (5,5), the parallel iso-quality lines continuously im¬ 
prove in value. Q = 10 is maximized at the upper right 
corner of the feasible region (5,5). As would be expected by 
simple inspection of the problem formulation, this is the 
only point that remains feasible to maximize Q. Note that 
Q = 2 is at a minimum at the lower left corner point (1,1). 



0 1 Z 3 4 6 6 


Fig. 1: Feasible region and ISO-qualiry lines of a two attribute 
unweighted data quality vector 
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Evaluating this LP using the Simplex method in Stat- 
graphics® provides the same solution after four pivots. The 
solution also provides the shadow prices, or margin values, 
for q { and < 7 , as 1.0 each. This indicates that quality function 
value increases a value of 1.0 for each corresponding unit in¬ 
crease in 17 or q ,. This is increasingly useful information as the 
complexity of quality objective functions increases. 

The percent of the maximum attainable quality (x) for an 
example data quality vector q = (4,2) is determined as fol¬ 
lows from equation (1): Q = 4 + 2 = 6 

From equation (2): 

x = % of attainable quality = ^ ] ^_ o j xl0 ° 
x = 50% 

Therefore, the aggregate DQI for the (4,2) data quality vec¬ 
tor is 3 from Table 2. Then, from Table 1, stochastic LCA 


model input data with this aggregate DQI of 3 would be 
replaced with random variables drawn from a beta prob¬ 
ability distribution with range endpoints specified at ±30% 
of each input datum’s value and with shape parameters a = 1 
and P = 1. These particular beta distribution shape param¬ 
eters transform the beta distribution into a uniform prob¬ 
ability distribution over the range of possible input data 
values (i.e., +30% of the original value in this instance). 

Table 3 contains the results of an evaluation of all the com¬ 
binations of the 1 x 2 integer-valued data quality vectors. 
Permutations of the vector components are not included 
because the ordering of vector components in the case of 
evenly weighted data quality attributes does not alter the 
result. For instance, the example presented above evaluates 
the aggregate DQI for the vector q=(4,2) as 3 (50%). The 
evaluation of data quality vector q =(2,4) shown in Table 3 
indicates that the same result is obtained. 


Table 2: DQI assignment matrix 


Achieved Percent (x) of Maximum Attainable Quality Value 

Data Quality Indicator (DQI) 

0<x< 12.5 

1 

12.5 < x < 25 

1.5 

25 < x < 37.5 

2 

37.5 < x < 50 

2.5 

50 < x < 62.5 

3 

62.5 < x < 75 

3.5 

75 < x < 87.5 

4 

87.5 < x < 100 

4.5 

O 

O 

II 

X 

5 


Table 3: All cases of a 1 x 2 integer-valued data quality vector (max Q = 10; min Q = 2) 


Data Quality Vector 

C2 

% of attainable max Q 

Aggregate DQI 

(1 1) 

2 

0 

1 

(1 2) 

3 

12.5 

1.5 

(1 3) 

4 

25 

2 

(1 4) 

5 

37.5 

2.5 

(1 5) 

6 

50 

3 

(2 2) 

4 

25 

2 

(2 3) 

5 

37.5 

2.5 

(2 4) 

6 

50 

3 

(2 5) 

7 

62.5 

3.5 

(3 3) 

6 

50 

3 

(3 4) 

7 

62.5 

3.5 

(3 5) 

8 

75 

4 

(4 4) 

8 

75 

4 

(4 5) 

9 

87.5 

4.5 

(5 5) , 

10 

100 

5 
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A data quality vector containing three attributes is repre¬ 
sented in E\ In this case, the feasible region is the interior 
of a cube as illustrated in Figure 2. The iso-quality func¬ 
tions are represented as parallel cutting planes. These planes 
cut through the feasible region and each represents a feasi¬ 
ble solution for those data quality vectors contained within 
the feasible region. All points on the iso-quality plane rep¬ 
resent the same quality function value. For evenly weighted 
data quality attributes, the cutting planes are perpendicu¬ 
lar to the optimal data quality vector (5,5,5) just as the iso¬ 
quality lines in the two data quality attribute case are per¬ 
pendicular to the optimal data quality vector (5,5). 



Fig. 2: Feasible region and ISO-quality planes of a three attribute 
unweighted data quality vector 


2.2.2 Vector projection method 

An alternate approach is to compute the percent of the 
magnitude of the optimal data quality vector, u, covered by 
the magnitude of the projection, p, of the data quality vec¬ 
tor of interest, q, on u. The percentage of u covered by p in 
E" is defined as follows: 


% of attainable quality 



M-p 


jx 100 


( 3 ) 


where Protter and Morrey (1964) derive ||u|| and ||p|| as: 

l|u]l = ( u - U ) K and |HI = (P P)^ 


for i = 1,2,...,w. Figure 3 illustrates the E 2 feasible region 
with the minimum and maximum magnitude vectors, minq 
and maxq, displayed. An example data quality vector, 
q=(3,2) is also shown. As indicated in Figure 3, minq projects 
directly on maxq. The optimal vector, u, is defined as the 
difference between maxq and minq. Therefore, u is a lx/i 
vector and u. = 4 for i = 1,2,...,«. 



Fig. 3: Vector projection of a two attribute unweighted data qual¬ 
ity vector 


The projection, p, of a vector, b, on a as derived by Frai.ekh i 

a b 

and Beauregard (1987) is- x a . Therefore, for any data 

a-a 

u • q 

quality vector, q, its projection, p, on u is —— x u . Thus, 


p =- - ■■• ■ xu 

r ll 



f=l 




— 


■xu 




1 

= T7-1* U & XU 

16 n£l 


Consider the maximum magnitude the lx« data quality 
vector, maxq, can assume. In this case, maxg. = 5 for i = 
1,2,...,«. In the same manner, the minimum magnitude the 
lx« data quality vector, minq, can assume is when ming. = 1 


Each element of p is: 


4 ■A 16 vp 1 v’ — 
Pi =TTl* Aq ‘ = ~li c h =c l 

16 n~f I 6/1777 n , =l 


for i = 1,2, 
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p must be adjusted by subtracting minq so that it is 100% 
relative to u, thus, 

P a i = Pi - 1 -q ~ 1 for 1 = 1,2,...,/! 


HI = jZ(<? - *) 2 = V n ( 9->) 2 = (</ - 0 >/n 

j|u|j = V/)4 : = 4-V« 

From equation (3) and substituting pa for p: 
% of attainable quality 


A-Jn - (c/ - 1 yjn 

f, [ 4 -(?-oy« 

4Vn 


x 100 


From equation (5): 

% of attainable quality 

1 

= AieJ- *100 = —-x 100 


range 

n g 

4 

n A 

~L c i’ 

An 


- x I - 1x100 


■ x 100 



( , 

r » i 

\ 

x 100 

/ 

i 

,4/i 

»M 

1 

i_ 

/ 


x 100 


/ 

1 

v 


[4-9 + 1] 


X 100 = 



x 100 



jx 100 


( 6 ) 


2.2.2.1 Illustrative example 


Consider the example data quality vector, q = (3,2), illustrated 
in Figure 3. The percent of attainable quality achieved is: 


% of attainable quality = 


(3 + 2) 
4(2) 


4 


x 100= 37.5% 


This represents an aggregate DQI of 2.5 from Table 2. Note 
that this is the same result as presented in Table 3 for the 
data quality vector q = (2,3). 


2.2.3.1 Illustrative example 


Consider the /j = 3 component example of a data quality 
vector, q =(2.4,5,4.1). The percent of attainable quality 
achieved is from equation (6): 


x = % of attainable quality 


[4(3) 
x = 70.83% 


[2.4 + 5 + 4.l]f-i 


x 100 


This represents an aggregate DQI of 3.5 from Table 2. 


2.3 Proof of equivalence of analysis methods 


2.2.3 Expected value method 

The first step in this method is to find the expected value, 
q , of the vector’s components. Then, the percent of attain¬ 
able quality is determined by computing the percent of the 
range of aggregate data quality indicators, i.e., max g.-min 
q n that q represents. Since the range of values the q. can 
assume is 1 < q. < 5, the quality range is 4. Also, similar to 
the LP and vector projection methods, gmust be adjusted 
by subtracting min q. to account for the true percent of 
quality range since min g > 0. 

E (e) = 9=!!>, (5) 

U 1=1 


It is obvious from the development of the vector projection 
and expected value analysis methods that both result in the 
same aggregate value (reference equations (4) and (6)). 
However, the equivalence of the LP analysis method is not 
readily apparent without further development. The basic 
structure of the LP formulation enables the use of an equiva¬ 
lent numerical approach to determine the aggregate DQI 
from equation (2) as follows: 

% of attainable quality 

n n 

= 4=1- 1 = 1 —X 100 

m n 

15-Zl 

/=! /=! 
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5 n — n 


-x 100 


An 




x 100 


—X?, "7 xl0 ° 

4/! 4 1 


(7) 


Equations (4), (6), and (7) are all identical, proving that 
any of these data quality vector analysis methods result in 
the same DQI assignment. Note that these are for the rat¬ 
ing scale of one to five with five being the best value. The 
analysis methods are easily adapted to any rating scale the 
LCA practitioner prefers. 


3 Application of Analysis Methodology to 
Real-World LCA Data Pedigrees 

This method is also applied to the pedigrees presented in 
Wkidema and Wksnoks. The authors present 27 pedigrees 
for the data contributing to the energy consumption for 
crop production in the unallocated life cycle of rye bread. 


They also present the basic coefficient of variation (CV) for 
each datum and a modified CV (and mean, p) for each da¬ 
tum after incorporating additional uncertainties relating to 
the datum’s pedigree. 

Table 4 presents a subset of this data along with the com¬ 
parable results that would be attained by applying the data 
quality vector analysis methodology to the pedigrees. The 
pedigrees were established based on the same scoring scale 
of one to five, however the scale is reversed. A one repre¬ 
sents the best case and a five the worst. Therefore, the pedi¬ 
gree scores presented in Table 4 are transformed for appli¬ 
cability to the aggregation methodology presented here. For 
example, a one becomes a five and vice versa, a two be¬ 
comes a four and vice versa, and a three remains a three. 
Also, an additional column has been added to the Weidkma 
and Wksnoks information by applying a ±3s to the mean 

s 

based on CV = — where s = C . Assuming a normal distri¬ 
bution applies, this is meant to provide an indication of the 
percentage of the mean that captures the majority (99.7%) 
of the values the input data may assume. A similar indica¬ 
tor from the beta probability distribution parameterization 
presented in Kennedy et al. and reproduced in Tabic 1 is 
applied to the aggregate DQI for all sensitivity levels. 


Tabic 4: Application of data quality vector analysis methodology to the data presented in Weidkma and Wksnoks (1995) using trans¬ 
formed pedigree matrices 


WEIDEMA and WESNOES (1 995) Data (Note: the 
±3.v column is derived from the modified CV 


Data 

Basic 

CV 

<%) 

Data 

Quality 

Index 

Modified 
CV (%) 
(and n) 

±3.v ” 

Aggregate 

DQI 

Baseline 

Cose 

(«.IT 

SENS 

L-l 

(0.3!- 

SENS 

L-2 

i 

1 

(4 5 5 5 4) 

10 

±307 

4,5 

±.l 5x 
(4,4) 

+257 

(4,4) 

±.357 

(2,2) 

2 

1 

(5 5 5 5 5) 

1 

±.03.7 

5 

+.1 07 
(5,5) 

±207 

(4,4) 

±.307 

(3,3) 

3 

19 

(5554 5) 

25 

±.757 

4.5 

±.15* 

(4,4) 

±.25* 

(3,3) 

±.357 

(2.2) 

4 

8 

(5 5 5 4 5) 

12 

±367 

4.5 

+.157 
(4,4) 

±25.7 

(3,3) 

+.35.7 

(2,2) 

5 

9 

(5 5 5 4 5) 

13 

±39 x 

4.5 

+.157 
(4,4) 

±25.7 

(3,3) 

±357 

(2,2) 

6 

6 

(5 5 5 4 5) 

18 

±.54 .7 

4.5 

+.157 

(4,41 

±257 

(3,3) 

±35.7 

(2,2) 

7 

10 

(55 54 5) 

20 

±.607 

4.5 

+.! 57 
(4,4) 

±257 

(3,3) 

±357 

(2,2) 

8 


(5 5 5 3 5) 

25 

+.757 

4.5 

±.l 57 

(4,4) 

±25.7 

(3,3) 

±357 

(2,2) 

9 

14 

(5 5 5 4 5) 

22 

±,667 

4.5 

+.157 
(4,4) 

±257 

(3,3) 

±357 

(2,2) 

to 

11 

(5554 5) 

20 

±.60* 

4.5 

±.l 57 
(4.4) 

±257 

(3,3) 

±357 

(2,2) 

11 

6 

(4 5 5 4 2) 

37 

±1.1 17 

4 

±.207 

(3,3) 

±.307 

(2,2) 

±,40.7 

<U) 

12 

6 

(4 5 5 1 1) 

71 

+2.137 

3 

+.307 

(U) 

±.407 

(l.M 

±307 

(U) 

13 

59 

(3 5 1 3 2) 

62 (n 
20% 
higher) 

±1.867 

2.5 

+.357 

(U) 

±.45.v 

(I.U 

±30x 

(U) 


Results of Applying the Data Quality 
Vector Analysis Methodology to the 
"Transformed" Data Qu 


Assuming the applicable probability distribution <s the normal, +3.V represents 99 7% of the values that the data element can assume 
as a random variable 

The shape parameters provide additional information obout the likelihood of drawing rondom variable data values from the 
extremes of the range on ,Y (see KENNEDY, MONTGOMERY ond Qua- (1996)) In this cose, the ♦% X represents 1 OCTe of the values the 
dolo element con ossume as a rondom variable regardless of the distribution parameters 
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According to Funtowicz and Ravetz (1990), the applica¬ 
tion of aggregation methodology to the pedigrees is accept¬ 
able given no other available information about the NUSAP 
assessment of the data. However, it is important to note 
that a direct one-to-one comparison in methods is not pos¬ 
sible because the approaches differ somewhat. The intent is 
to demonstrate the usefulness of the aggregation methodol¬ 
ogy to real-world data and to discuss those comparisons 
that are relevant. 

The Weidema and Wesnoes modified CV column indicates 
that in instances of low input data quality, the input data 
random variables can assume negative values. The authors 
do not address this circumstance. Obviously, control 
boundaries (i.e., range endpoints) would need to be applied 
in the modeling approach to prevent the use of negative 
values. The data quality vector methodology results in range 
endpoint selections that ensure the input data random vari¬ 
ables can assume only positive values. 

Many of the pedigrees in Table 4 are the same, e.g., data 
elements three through seven, nine, and ten have the same 
data quality index. It would seem that the same pedigree, 
or data quality index, should represent the same degree of 
uncertainty about the data element. This is the case with 
the data quality vector methodology. However, this is not 
the case with the Weidema and Wesnoes approach. Although 
the authors do not address exactly why this occurs, it ap¬ 
pears that additional weighting or fractional data quality 
assessments are being made using the pedigree as a guide. 
The data quality vector methodology can account for such 
fractional assessments within the vector. It can also accom¬ 
modate data quality attribute weighting with some minor 
adjustments to the aggregate quality function formulations. 

4 Data Quality Vectors with Weighted Data 
Quality Attributes 

In most cases, the input data quality attributes will be evenly 
weighted. However, there will undoubtedly be instances when 
the LCA practitioner(s) will want to weight one quality at¬ 
tribute more heavily than another. For example, the energy 
consumption requirements of the processing technology asso¬ 
ciated with the bottling system alternative LCA models may 
be changing quickly over time. New or modified processing 
methods may be altering processing efficiencies. In this instance, 
the LCA practitioner may want to weight the "age of data” 
quality attribute more heavily than the other quality attributes 
to account for such volatility. In the Weidema and Wesnoes 
rye bread LCA example, one of the data quality goals was 
that recent data was preferred over other data quality aspects 
(i.e., attributes). The weighted data quality attribute method 
can be used to express this preference so that the resultant 
aggregate DQI reflects the appropriate additional or reduced 
data uncertainty. In this way, independence is maintained be¬ 
tween data quality attribute judgments. 

As in assigning the quality scores for each attribute, LCA 
practitioners must also select the magnitude of the data 


quality attribute weights using judgment. There are a 
number of techniques available to assist with this process. 
One of the more extensive groups of decision analysis tools 
available are those associated with multiattribute utility theory 
(MAUT). Goicoechea, Hansen and Duckstein (1982) provide 
extensive detail on the MAUT methods as well as an assort¬ 
ment of less formal group decision making techniques. 

Incorporating weighted data quality attributes is a straight¬ 
forward extension of the data quality vector method. Con¬ 
sider an associated weight vector w such that ( w,:i = 1 , 2 , ...n ) 
are the set of n data quality attribute weights. The quality 
function objective is now expressed as: 

II 

2 = X (8) 

1=1 

subject to (as before): 


No special conditions on w are necessary. The w., like the 
q : , are real or integer values. The data quality attribute 
weight vector is normally a unit vector and, as such, has no 
effect on the quality function. There is no reason to select a 
weight of zero for a particular attribute because the effect 
would be to remove that attribute and, thus, its quality 
indice, from further consideration. This is better accomplished 
by evaluating the applicable n - 1 data quality vector. 

The effect of the m is limited to the objective function. Any 
data quality attribute weight vector, other than the unit vec¬ 
tor, will change the slope of the objective quality function. 
It has no effect on the feasible region as indicated by the 
constraint set. 

As before, the quality function is evaluated for individual 
weighted data quality vectors. The result is divided by the 
maximum attainable value for the weighted quality func¬ 
tion found using LP. This provides the same relative per¬ 
centage as before so that the same aggregate DQI assign¬ 
ment policy is applicable. This, in turn, enables use of the 
same beta probability distribution parameterization as before. 

4.1 Illustrative example 

Again, consider the case of a data quality vector q contain¬ 
ing n - 2 data quality attributes. The associated weight vec¬ 
tor w is (2, 0.5). The formulation of the LP problem using 
equation ( 8 ) becomes: 

Maximize Q = vv, <7 , + w 2 q 2 = 2< 7 , + Q5q 2 
subject to: 

< 7,>1 
<7, <5 
Qi - 1 
q 2 <5 
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The graph of this formulation in E 2 is shown in Figure 4. 
The feasible region is the same as was indicated in Figure 1. 
Several iso-quality lines have again been added to show the 
change in slope that occurs from the weighted quality ob¬ 
jective function variables. The iso-quality lines are no longer 
perpendicular to the optimal vector (5,5) which indicates 
that the quality value can be increased a greater amount 
per unit of q, or q , by increasing the heavier weighted qual¬ 
ity vector component. In this case, the upper right hand cor¬ 
ner of the feasible region, i.e., (5,5) indicates that Q = 12.5 is 
the maximum. The minimum is at (1,1) where Q =2.5. 



Fig. 4: Feasible region and ISO-quality lines of a two attribute- 
unweighted data quality vector (w = (2, 0.5)) 

Evaluating this LP using the Simplex method in Stat- 
graphics 1 * provides the same solution after four pivots. The 
solution also provides the shadow prices, or margin values, 
for q y and < 7 , as 2.0 and 0.5 respectively. This indicates that 
the quality function value increases a value of 2.0 for each 
corresponding unit increase in q and a value of 0.5 for 
each corresponding unit increase in < 7 ,. 


data quality range is equivalent to the LP analysis method. 
In this case, the weighted mean, if,., must be determined 
and used in the percent of total quality range computation 
as above. Hamburg (1983) defines the weighted mean as: 


i* w, 



i=i 

Therefore, from equation (9): 

% of attainable quality 

H n 

X X 

Jij- 1 -if- 1 
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( » \ 
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1=1 
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n 

4 X »i 4 


X-, 

V t=i > 


V i=l / 


X"'//- X T 

2ei_-irl-x 100 

4 X v 'i 4 X v, '< 


X ~ X 

-4=1---=1-X 100 

4^ 


( 10 ) 


The LP approach requires further development to complete 
the proof. From equations (2) and ( 8 ): 

% of attainable quality 


The percent of the maximum attainable range of Q for an 
example data quality vector q = (4.5, 2.5) is from equation 
( 8 ): Q = 2(4.5) + 0.5(2.5) = 10.25 
From equation (2): 

x = % of attainable quality = {——1 x 100 

V 12.5-2.5 ) 


x = 115% 


Therefore, the aggregate DQI for the (4.5, 2.5) data quality 
vector is 4 from Table 2. 


2>«?r-5>i 

4=1-1=1—x 100 

X^-X". 

1=1 1=1 

n n 

X W ‘ C I: ~ X W > 

4=1-1=1—x 100 

5jU-jrvv,. 


4.2 Proof of equivalent analysis methods 

As before, the ratio of the expected value of the weighted 
data quality vector to the maximum attainable weighted 


Xw-X". 


i=l 


i=t 


4 X-. 


x 100 


( 11 ) 
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Note that equations (10) and (11) are identical. Also, as 
before, this proof is based on the one to five sliding scale 
with five representing the best case. 

5 Conclusions 

The DQI development extension methodology presented 
does not require complex mathematical analyses. The in¬ 
creased precision (i.e., fractional estimates) in the data qual¬ 
ity indices provided for the LCA models presented in 
Kennedy et al., are directly applicable and do not contrib¬ 
ute to increased cost or loss of effectiveness. To the con¬ 
trary, for those practitioners inclined to use fractional esti¬ 
mates as a means to quantify further resolution in the 
uncertainty of the data, the increased precision results in a 
more fine-tuned uncertainty analysis. As yet, the effect on 
the precision of individual input data elements is difficult 
to discern. With thousands of input variables, it is a diffi¬ 
cult task to evaluate the main effects and interaction effects 
up to the nth order. Therefore, LCA practitioners should be 
encouraged to provide as much precision in the DQI’s as 
they feel is reasonable to assess. 

Funtowicz and Ravetz report that in their experience, there 
is a remarkable degree of agreement on the pedigree ratings 
among experts within the general area of competence. This 
is attributed to the fact that once the modes are well de¬ 
fined and understood by the evaluators, there is very little 
room for disagreement. Therefore, the pedigree is consid¬ 
ered to be a robust indicator of the strength of the associ¬ 
ated NUSAP assessment. Likewise, the LCA practitioners 
that assigned the DQI’s to the data in the models presented 
in Kennedy et al. were in general agreement on the approach 
and the outcome once an understanding of the sliding scale 
was achieved among the raters. It is expected that the ro¬ 
bust nature of the pedigree and single-valued DQI’s will 
also be characteristic of the data quality vector. 

As an alternative to the weighted method, LCA practi¬ 
tioners) might prefer to establish more detailed constructs 
for the individual data quality attribute indices in an at¬ 
tempt to account for preferences between attributes. How¬ 
ever, the weighted data quality attribute approach is more 
efficient and effective. The decision on what weights to use 
is accomplished once and does not require stronger quality 
attribute constructs that may be difficult to develop such 
that the entire data set is adequately represented. In addi¬ 
tion, the weighted data quality attribute approach guaran¬ 
tees equivalence is maintained across all LCA model input 
data for all associated attributes. Whereas, the individual 
scoring of each data element, even using a more detailed 
scoring construct, requires the quantification of additional 
LCA practitioner subjectivity across all data elements. This 
naturally leads to a reduction in the equivalence of quality 
assessments between data. 

The main benefit to evaluating the problem using the linear 
programming approach is the ability to readily determine, 


as in the weighted case, which quality component(s) to im¬ 
prove to minimize uncertainty (maximize quality) in any 
particular data element. This is done by analyzing the 
shadow prices (quality margin values). Since the quality 
function is additive, most of this analysis can be done by 
inspection of the quality function once the LCA practitioner 
gains a thorough understanding of the concept and becomes 
skilled at interpreting these models. The LP approach is 
also useful for evaluating any special cases of data quality 
vectors that may generate the need to develop and evaluate 
more complex quality objective functions. 

The LCA practitioner can be assured that regardless of the 
spread indicated by the data quality vector components, 
the ratio of the arithmetic average of the vector compo¬ 
nents to the total quality range attainable provides the ap¬ 
propriate measure of aggregate quality. This also applies in 
the weighted average case. The aggregation methods pre¬ 
sented are also easily adapted to other numerical rating 
scales the LCA practitioner may prefer. The resultant ag¬ 
gregate DQI’s map directly to the DQI’s used for stochastic 
LCA modeling. 
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